AITopics | attention layer

We study attention mechanisms through the lens of a canonical unsupervised problem: principal component analysis (PCA). We show that, when trained on Gaussian data, both softmax and linear attention layers learn parameters that align with the principal eigenvectors of the covariance matrix, thereby establishing a direct and explicit connection with PCA. Our analysis covers both finite and infinite prompt regimes. In the infinite-prompt limit, we prove convergence to globally optimal solutions aligned with the leading spectral direction, while in the finiteprompt setting we show that the same behavior emerges up to sampling effects. We further extend the analysis to an in-context setting with spiked Wishart covariances, where attention successfully recovers the underlying signal direction. These results demonstrate that attention inherently performs PCA-like computations under unsupervised objectives, providing a theoretical foundation for its representation-learning capabilities.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2605.18315

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)

Add feedback

Probabilistic Attention for Interactive Segmentation

Neural Information Processing SystemsMay-1-2026, 01:53:12 GMT

We provide a probabilistic interpretation of attention and show that the standard dotproduct attention in transformers is a special case of Maximum APosteriori (MAP) inference. The proposed approach suggests the use of Expectation Maximization algorithms for online adaptation of key and value model parameters. This approach is useful for cases in which external agents, e.g., annotators, provide inference-time information about the correct values of some tokens, e.g., the semantic category of some pixels, and we need for this new information to propagate to other tokens in a principled manner. We illustrate the approach on an interactive semantic segmentation task in which annotators and models collaborate online to improve annotation efficiency. Using standard benchmarks, we observe that key adaptation boosts model performance ( 10% mIoU) in the low feedback regime and value propagation improves model responsiveness in the high feedback regime.

computer vision, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.95)
(2 more...)

Add feedback

Supplementary materials for Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing Anonymous Author(s) Affiliation Address email AAdditional graphs from outlier analysis1

Neural Information Processing SystemsApr-30-2026, 05:24:42 GMT

Figure 1: A summary of several outlier statistics recorded from ImageNet validation set on ViT. We use zero-based indexing for dimensions. BERTRecall from Figure 1 that all the outliers are only present in hidden dimensions #123, #180,4 #225, #308, #381, #526, #720 (with the majority of them in #180, #720). In Figures 9 and 10 we show more6 examples of the discovered self-attention patterns for attention heads #3 and #12 ( hidden dim #1807 and #720, respectively). We also show self-attention patterns in attention heads and layers which are8 not associated with the outliers in Figures 11 and 12, respectively.9

artificial intelligence, attention layer, machine learning, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

Neural Information Processing SystemsApr-30-2026, 05:24:39 GMT

Transformer models have been widely adopted in various domains over the last years, and especially large language models have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways to reduce the computational time and memory consumption of neural networks. Many studies have shown, however, that modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. To retain acceptable performance, the existence of these outliers requires activations to be in higher bitwidth or the use of different numeric formats, extra fine-tuning, or other workarounds.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

Europe (0.92)
North America > United States > Minnesota (0.28)
Asia > Middle East (0.28)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

cc57fac10eacadb3b72a907ac48f9a98-Paper-Conference.pdf

Neural Information Processing SystemsApr-29-2026, 19:18:56 GMT

artificial intelligence, graph, machine learning, (14 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Genre: Research Report (0.67)

Industry:

Leisure & Entertainment (0.46)
Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

642a321fba8a0f03765318e629cb93ea-Paper-Conference.pdf

Neural Information Processing SystemsApr-28-2026, 06:39:46 GMT

explanation, large language model, machine learning, (20 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.68)

Add feedback

1d5a92867cf463fad136cfa23395840b-Supplemental-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 15:12:43 GMT

artificial intelligence, machine learning, mse, (15 more...)

Neural Information Processing Systems

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

1c10d0c087c14689628124bbc8fa69f6-Supplemental-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 13:46:24 GMT

A.1 For LEHD model467 In Table 5, we explore the effects of eliminating normalization from the attention layer in our LEHD468 model. We train three LEHD models with the same training scheme and training budget, differing469 solely in the attention layer: one with batch normalization (BN), one with instance normalization470 (IN), and one without normalization (w/o). We train all three POMO models with the same reinforcement learning method477 with POMO strategy and training budget (1000 epochs). The results show that different types of478 normalization have few effects on the POMO model.479 The results in Table 6 show that removing normalization from attention layer has little impact on the480 model with a heavy encoder and a light decoder.

artificial intelligence, machine learning, node, (19 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.35)

Add feedback